Scanning 2.5 million domains - part 1

Hello friends!

Recently I've talked with some work colleagues about public .git repositories and other oblivious and low-skill attacks on domains.
Some of you might not be aware of it but the .ch zone (every .ch domain like admin.ch) is open data and can freely be downloaded from switch.ch also check this out Compass lookup tool.
I'm a boring person with no life so I've decided to dive a little into the .ch zone.

This blog will be multipart where I'll take a look at different security settings and they impact on our domains.

All in all we're talking about 2.5 million domains (which not all have a http/https server listening on them).

What to check

First off I've decided on a workplan.

  • What needs to be checked
  • How deep can I go without breaking the law?
  • Do I even care?

I created the following json skel which includes everything I'll check.

{
	"domain": "musterdomain.ch",
	"ip": "127.0.0.1",
	"time_of_req": "2021-05-24 14:58:37.304529",
	"robots": {
		"http": false,
		"https": false
	},
	"svn": {
		"http": false,
		"https": false
	},
	"git": {
		"http": false,
		"https": false
	},
	"http_banner": [
		{
			"server": "nginx",
			"content-type": "text/html",
			"date": "Thu, 13 May 2021 17:29:12 GMT",
			"x-xss-protection": null,
			"content-security-policy": null,
			"content-origin-embedder-policy": null,
			"cross-origin-opener-policy": null,
			"cross-origin-resource-policy": null
		}
	],
	"https_banner": [
		{
			"server": "nginx",
			"content-type": "text/html",
			"date": "Thu, 13 May 2021 17:29:12 GMT",
			"x-xss-protection": null,
			"content-security-policy": null,
			"content-origin-embedder-policy": null,
			"cross-origin-opener-policy": null,
			"cross-origin-resource-policy": null
		}
	],
	"certificate": [
		{
			"san": [
				"san1",
				"san2"
			],
			"expired": false,
			"subject": "CN=musterdomain"
			"until": "2021-07-08 04:21:47",
			"issuer": "CN=R3,O=Let's Encrypt,C=US",
			"issued_to": "CN=musterdomain.ch"
		}
	],
	"security.txt": {
		"http": false,
		"https": false
	}
},

This includes public .git repositories (clean them up plz), public .svn repositires (why does this even still exist?) and other stuff.

I am devloper.

I like to create my own stuff I've decided against using already established tools like wfuzz.
Although in hindsight it would have saved me a lot of headaches...

Alas I put on my programming socks and got to town.

I am devloper?

My first approach was to use a python script with multithreading support (scan every domain in its own thread).
Turns out, that this was a terrible idea and in no way usable.
This approach would have taken me almost a year to scan everything as the request module is blocking which wouldn't be that bad in a multithreaded application, but I wanted to write everything into a single file which would need to be locked for each write instruction to be thread-safe.

I am devloper!

My next approach was to use python with async.
This turned out to be the right one. My initial tests to scan 1000 domains took around 4-7 minutes depending on the response time.
As this was still to slow I incorporated a hacky queue system to not overload my RAM (think about the children) and await the whole queue to finish before writing everything to a file.

Queue and worker size

I've run the script with different queue and worker sizes to determine which setting would give me the most speed.

Worker Queue #1 Speed (s) #2 Speed (s) #3 Speed (s) Print
50 50 304 293 227 yes
100 100 443 185 190 yes
50 100 199 187 188 no
100 150 302 160 169 no

If you take a look at the 50/100 #1 Speed check you might ask yourself - "wth is it so fast".
As you might know (or not) GNU/Linux does not employ a DNS cache so this can't be it.
My bind9 on the other hand does.

In the end I decided to go 50/100 because it doesn't really matter that much.

SANIC!

Testcases with 1000 domains took around 180sec to scan them fully.
While that's fast AF it's not really a real world example as the CPU and memory management also might block.

sanic

If I'll run it again I'll just pay some bucks and get myself a loaded Azure VM.

I am devopser!

Finally to speed things up I put the script in a docker image, splitted the zone and ran the containers besides each other.
The only timesink that I couldn't remove was my limited resources (RIP Intel NUC).

I'll put the script in my public github repository.

What's next?

Next post will focus on the public facing .git and .svn directories.